Programming with R {dplyr} - As I Understand It!!

Rstats
dplyr
functions

How to create your own functions using {dplyr}

Vishal Katti
07-17-2021

Introduction

The purpose of this document is to act as a quick guide for myself and others to understand how to use dplyr effectively to create dynamic functions. The general assumption is that the reader is familiar with the {dplyr} package and how to use it for data wrangling.

In this document, we will explore how to create functions using the popular dplyr verbs like select, filter, mutate, arrange and finally group_by with summarise.

Inspiration

I regularly deal with event-related information with event date and few other columns like event type, root cause etc. Most reports usually involve calculating number of events that took place on a monthly, quarterly or annual basis, sometimes split by event type, root cause and other columns. After a few reports I realized that I am basically writing the same code over and over again to calculate these KPIs. Keeping the DRY (Don't Repeat Yourself) principle in mind, I managed to write a few functions to calculate these KPIs with a few dynamic variables. Following is an attempt to articulate what I learnt while creating those functions.

Data

We shall use the Texas Housing Sales data, available as a tibble in the popular ggplot2 package as reference data. It contains monthly information about the housing market in Texas provided by the TAMU real estate center, https://www.recenter.tamu.edu/. It has 8602 observations and 9 variables.

txhousing <- ggplot2::txhousing
dplyr::glimpse(txhousing)
Rows: 8,602
Columns: 9
$ city      <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abile~
$ year      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20~
$ month     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4,~
$ sales     <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100~
$ volume    <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 1391~
$ median    <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 7~
$ listings  <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, ~
$ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, ~
$ date      <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, ~

We shall refer the above data in all the following sections.

select()

When using dplyr functions, the two most popular ways to pass column names is either as bare names i.e. column names without enclosing them in quotes like sales or volume OR pass them as a character string like “sales” or ‘volume’. You could also pass a character vector like c("sales", "volume"). In this section we will explore the 3 ways to dynamically select the columns we want.

Passing raw column names

In this method, we pass the raw name of the column we want to select and use the embrace of curly-curly brackets to pass the raw name. For multiple columns, we can pass the raw names as a single vector.

select_raw <- function(df, var) {
  dplyr::select(.data = df, {{var}}) %>%     # embrace of curly-curly {{}} brackets
    head()                                   # to limit the number of output rows in this example.
}
select_raw(txhousing, sales)                 # pass single raw name
# A tibble: 6 x 1
  sales
  <dbl>
1    72
2    98
3   130
4    98
5   141
6   156
select_raw(txhousing, c(sales, volume))      # pass a vector of raw names for multiple columns
# A tibble: 6 x 2
  sales   volume
  <dbl>    <dbl>
1    72  5380000
2    98  6505000
3   130  9285000
4    98  9730000
5   141 10590000
6   156 13910000

If passing multiple raw names as vector as in the select_raw() feels like an unnecessary complication, try the next method.

Passing multiple raw column names using … argument

In this method, we use the . argument to pass the raw names of the columns we want to select.

my_select <- function(df, ...) {
  dplyr::select(.data = df, ...) %>% 
    head()
}

my_select(txhousing, sales, volume)          # pass multiple raw names directly
# A tibble: 6 x 2
  sales   volume
  <dbl>    <dbl>
1    72  5380000
2    98  6505000
3   130  9285000
4    98  9730000
5   141 10590000
6   156 13910000

Passing a character vector of column names

If we have the column names as a character vector, we use the all_of function to pass the character vector to the internal select function.

my_select_char <- function(df, cols) {
  dplyr::select(.data = df, dplyr::all_of(cols)) %>% 
    head()
}

my_cols <- c("sales","volume")
my_select_char(txhousing, my_cols)
# A tibble: 6 x 2
  sales   volume
  <dbl>    <dbl>
1    72  5380000
2    98  6505000
3   130  9285000
4    98  9730000
5   141 10590000
6   156 13910000

filter()

In the previous section, we passed column names either as bare names or character strings. filter() takes one or more expressions/conditions that result in a logical vector, with same length as number of rows in the data.frame/tibble and returns only those rows for which the expression/condition returns TRUE. Following are 2 ways to pass these logical expressions/conditions. I’m using expression and condition interchangeably here. In this context, a condition is an expression that results in a boolean TRUE/FALSE result.

Passing single raw criteria

In this method, we pass the condition sales > 8000 as a raw/bare expression.

filter_raw <- function(df, cond) {
  dplyr::filter(.data = df, {{cond}})        # embrace of curly-curly {{}} brackets
}

filter_raw(txhousing, sales > 8000)          # Pass a single raw criterion
# A tibble: 10 x 9
   city     year month sales    volume median listings inventory  date
   <chr>   <int> <int> <dbl>     <dbl>  <dbl>    <dbl>     <dbl> <dbl>
 1 Houston  2006     5  8040    1.60e9 151200    35398       5.5 2006.
 2 Houston  2006     6  8628    1.80e9 155200    36281       5.6 2006.
 3 Houston  2013     5  8439    2.12e9 186100    20526       3.3 2013.
 4 Houston  2013     7  8468    2.17e9 187800    21497       3.3 2014.
 5 Houston  2013     8  8155    2.08e9 186700    21366       3.3 2014.
 6 Houston  2014     6  8391    2.34e9 211200    19725       2.9 2014.
 7 Houston  2014     7  8391    2.28e9 199700    20214       3   2014.
 8 Houston  2014     8  8167    2.20e9 202400    20007       2.9 2015.
 9 Houston  2015     6  8449    2.49e9 222400    22311       3.2 2015.
10 Houston  2015     7  8945    2.57e9 217600    23875       3.4 2016.

Do you think we can pass multiple bare conditions as a vector, like we did for select_raw() in the previous section? Let us try passing multiple raw criteria as a vector.

filter_raw(txhousing, c(sales > 8000, year > 2010))
Error in `dplyr::filter()`:
! Problem while computing `..1 = c(sales > 8000, year >
  2010)`.
x Input `..1` must be of size 8602 or 1, not size 17204.

Passing multiple raw criteria as a vector doesn’t work like it works for select_raw() function. Let us understand why. Consider the following code:

A <- c(TRUE, TRUE)                           # boolean vector of length = 2
B <- c(FALSE, FALSE)                         # boolean vector of length = 2
X <- c(A, B)
X
[1]  TRUE  TRUE FALSE FALSE

Notice that length of X is 4. Similarly, sales > 8000 evaluates to a TRUE/FALSE boolean vector of length 8602 (equal to number of rows in txhousing) and so does year > 2010. So the vector c(sales > 8000, year > 2010) becomes a TRUE/FALSE boolean vector of length 17204, which results in an error.

Passing multiple raw criteria using … argument

To pass multiple raw criteria, we can use the ... argument.

my_filter <- function(df, ...) { 
  dplyr::filter(.data = df, ...)                # pass the dots argument
  }

my_filter(txhousing, sales > 8000, year > 2010) # pass multiple raw criteria
# A tibble: 8 x 9
  city     year month sales     volume median listings inventory  date
  <chr>   <int> <int> <dbl>      <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Houston  2013     5  8439 2121508529 186100    20526       3.3 2013.
2 Houston  2013     7  8468 2168720825 187800    21497       3.3 2014.
3 Houston  2013     8  8155 2083377894 186700    21366       3.3 2014.
4 Houston  2014     6  8391 2342443127 211200    19725       2.9 2014.
5 Houston  2014     7  8391 2278932511 199700    20214       3   2014.
6 Houston  2014     8  8167 2195184825 202400    20007       2.9 2015.
7 Houston  2015     6  8449 2490238594 222400    22311       3.2 2015.
8 Houston  2015     7  8945 2568156780 217600    23875       3.4 2016.

Passing single criteria as a character string

By default, dplyr::filter() does not accept conditions as character strings. Following is an example which results in error

dplyr::filter(txhousing, "sales > 8000")
Error in `dplyr::filter()`:
! Problem while computing `..1 = "sales > 8000"`.
x Input `..1` must be a logical vector, not a character.

We need to convert the character condition into a raw expression.

my_filter_string <- function(df, cond) {
  dplyr::filter(.data = df, eval(parse(text = cond)))   # convert text to raw criterion
}

my_filter_string(txhousing, "sales > 8000")             # pass single text string as criteria
# A tibble: 10 x 9
   city     year month sales    volume median listings inventory  date
   <chr>   <int> <int> <dbl>     <dbl>  <dbl>    <dbl>     <dbl> <dbl>
 1 Houston  2006     5  8040    1.60e9 151200    35398       5.5 2006.
 2 Houston  2006     6  8628    1.80e9 155200    36281       5.6 2006.
 3 Houston  2013     5  8439    2.12e9 186100    20526       3.3 2013.
 4 Houston  2013     7  8468    2.17e9 187800    21497       3.3 2014.
 5 Houston  2013     8  8155    2.08e9 186700    21366       3.3 2014.
 6 Houston  2014     6  8391    2.34e9 211200    19725       2.9 2014.
 7 Houston  2014     7  8391    2.28e9 199700    20214       3   2014.
 8 Houston  2014     8  8167    2.20e9 202400    20007       2.9 2015.
 9 Houston  2015     6  8449    2.49e9 222400    22311       3.2 2015.
10 Houston  2015     7  8945    2.57e9 217600    23875       3.4 2016.

The special sauce here is the eval(parse(text = ...)) combo that converts the long text criteria into a single raw criteria and passes it to the internal filter() function.

Passing multiple criteria as character vector

What if want to pass multiple criteria as a string vector? In such a situation, we must combine all the string conditions into a single long string condition using paste0(..., collapse = " & "). The paste0("(", cond, ")", collapse = " & ") combines all the criteria into a single long criteria, but still a text string.

my_filter_strings <- function(df, cond) { 
  filter_text <- paste0("(", cond, ")", collapse = " & ")   # combine all criteria
  message("Filter Condition: ", filter_text)                # (OPTIONAL) show the combined filter string
  dplyr::filter(.data = df, eval(parse(text = filter_text)))# convert text to raw criterion
  }

my_filter_criteria <- c("sales > 8000", "year > 2010")
my_filter_strings(txhousing, my_filter_criteria)
# A tibble: 8 x 9
  city     year month sales     volume median listings inventory  date
  <chr>   <int> <int> <dbl>      <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Houston  2013     5  8439 2121508529 186100    20526       3.3 2013.
2 Houston  2013     7  8468 2168720825 187800    21497       3.3 2014.
3 Houston  2013     8  8155 2083377894 186700    21366       3.3 2014.
4 Houston  2014     6  8391 2342443127 211200    19725       2.9 2014.
5 Houston  2014     7  8391 2278932511 199700    20214       3   2014.
6 Houston  2014     8  8167 2195184825 202400    20007       2.9 2015.
7 Houston  2015     6  8449 2490238594 222400    22311       3.2 2015.
8 Houston  2015     7  8945 2568156780 217600    23875       3.4 2016.
my_filter_criteria_with_OR <- c("sales > 8000 | sales < 50", "year > 2010")

# NOTE: OR criteria must be a single string separated by pipe '|' as in example below.
my_filter_strings(txhousing, my_filter_criteria_with_OR)
# A tibble: 315 x 9
   city        year month sales volume median listings inventory  date
   <chr>      <int> <int> <dbl>  <dbl>  <dbl>    <dbl>     <dbl> <dbl>
 1 Brownsvil~  2011     1    48 4.97e6  83300      784      12.6 2011 
 2 Brownsvil~  2011     2    47 5.56e6 101400      776      12.7 2011.
 3 Brownsvil~  2011     7    47 4.81e6  91200      749      13.1 2012.
 4 Brownsvil~  2011    12    39 4.20e6  86800      726      12.4 2012.
 5 Brownsvil~  2012     1    43 3.89e6  85000      791      13.6 2012 
 6 Brownsvil~  2012     3    27 2.98e6  93800      734      13.3 2012.
 7 Brownsvil~  2012    11    41 5.12e6  99000      807      14   2013.
 8 Brownsvil~  2013    11    38 4.82e6 108000      859      13.4 2014.
 9 Brownsvil~  2015     1    41 5.40e6  97000      733      10.7 2015 
10 Galveston   2011     1    43 8.88e6 170000     1015      13.7 2011 
# ... with 305 more rows

mutate()

mutate() allows you to add new columns or modify existing columns. In the example below, we will create a new column volume_in_millions from the existing column volume. The names of both the columns can be passed to the function either as raw names or character strings.

Passing the column name as raw name

mutate_raw <- function(df, new_col_raw, old_col_raw, num = 1) { 
  dplyr::mutate(.data = df, {{new_col_raw}} := {{old_col_raw}}/num) %>% 
    head()
}

txhousing %>% 
  select(city, year, month, volume) %>% 
  mutate_raw(vol_in_millions, volume, 1E6) # pass raw column names w/o quotes
# A tibble: 6 x 5
  city     year month   volume vol_in_millions
  <chr>   <int> <int>    <dbl>           <dbl>
1 Abilene  2000     1  5380000            5.38
2 Abilene  2000     2  6505000            6.50
3 Abilene  2000     3  9285000            9.28
4 Abilene  2000     4  9730000            9.73
5 Abilene  2000     5 10590000           10.6 
6 Abilene  2000     6 13910000           13.9 

Passing the new variable name as character string (direct)

mutate_text <- function(df, new_col_str, old_col_str, num = 1) { 
  dplyr::mutate(.data = df, {{new_col_str}} := df[[old_col_str]]/num) %>% 
    head()
}

txhousing %>% 
  select(city, year, month, volume) %>%
  mutate_text("vol_in_millions", "volume", 1E6) # pass column names as strings
# A tibble: 6 x 5
  city     year month   volume vol_in_millions
  <chr>   <int> <int>    <dbl>           <dbl>
1 Abilene  2000     1  5380000            5.38
2 Abilene  2000     2  6505000            6.50
3 Abilene  2000     3  9285000            9.28
4 Abilene  2000     4  9730000            9.73
5 Abilene  2000     5 10590000           10.6 
6 Abilene  2000     6 13910000           13.9 

Passing the new variable name as character string (indirect)

Instead of passing the name of the variable as a character string as an argument, we can pass a variable containing the name of the variable. In the below example, the name of the new variable is stored in new_var. Using the new {glue} syntax, enabled by the walrus operator :=, we substitute the new_var variable with its value.

mutate_var <- function(df, new_col_var, old_col_var, num = 1) {
  dplyr::mutate(.data = df, "{new_col_var}" := df[[old_col_var]]/num) %>% 
    head()
}

new_var <- "vol_in_millions"
old_var <- "volume"

txhousing %>% 
  select(city, year, month, volume) %>%
  mutate_var(new_var, old_var, 1E6)  # pass column names as variables
# A tibble: 6 x 5
  city     year month   volume vol_in_millions
  <chr>   <int> <int>    <dbl>           <dbl>
1 Abilene  2000     1  5380000            5.38
2 Abilene  2000     2  6505000            6.50
3 Abilene  2000     3  9285000            9.28
4 Abilene  2000     4  9730000            9.73
5 Abilene  2000     5 10590000           10.6 
6 Abilene  2000     6 13910000           13.9 

arrange()

arrange() sorts the rows of a data frame by the values of selected columns. By default, it sorts in Ascending order. To force a column to sort in Descending order, we must use the desc() function.

Passing single raw name

arrange_raw <- function(df, var) {
  dplyr::arrange(.data = df, {{var}}) %>%    # embrace of curly-curly {{}} brackets
    head()
}

arrange_raw(txhousing, sales)
# A tibble: 6 x 9
  city         year month sales volume median listings inventory  date
  <chr>       <int> <int> <dbl>  <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 San Marcos   2011    10     6 1.16e6 180000      163       8.3 2012.
2 Harlingen    2000     7     9 1.11e6  87500      719      30.8 2000.
3 South Padr~  2011     1     9 2.09e6 225000     1258      55.7 2011 
4 San Marcos   2011     1    10 1.48e6 140000      165       7.5 2011 
5 San Marcos   2011    12    10 1.56e6 140000      148       8   2012.
6 San Marcos   2014    11    10 1.51e6 146700       96       4   2015.
arrange_raw(txhousing, desc(sales))
# A tibble: 6 x 9
  city     year month sales     volume median listings inventory  date
  <chr>   <int> <int> <dbl>      <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Houston  2015     7  8945 2568156780 217600    23875       3.4 2016.
2 Houston  2006     6  8628 1795898108 155200    36281       5.6 2006.
3 Houston  2013     7  8468 2168720825 187800    21497       3.3 2014.
4 Houston  2015     6  8449 2490238594 222400    22311       3.2 2015.
5 Houston  2013     5  8439 2121508529 186100    20526       3.3 2013.
6 Houston  2014     6  8391 2342443127 211200    19725       2.9 2014.

arrange_raw() fails when we pass multiple raw names as a vector.

arrange_raw(txhousing, c(sales, volume))
Error in `dplyr::arrange()`:
! Problem with the implicit `transmute()` step.
x Problem while computing `..1 = c(sales, volume)`.
x `..1` must be size 8602 or 1, not 17204.

Passing multiple raw names using ... argument

To pass multiple raw names, we must use the ... argument.

arrange_raw_multiple <- function(df, ...) {
  dplyr::arrange(.data = df, ...) %>% 
    head()
}

arrange_raw_multiple(txhousing, city, sales)
# A tibble: 6 x 9
  city     year month sales  volume median listings inventory  date
  <chr>   <int> <int> <dbl>   <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Abilene  2003     1    68 5385000  70000      668       5.4  2003
2 Abilene  2011     1    68 8834493 123300      809       6.1  2011
3 Abilene  2009     1    70 8414801  92900      861       6.3  2009
4 Abilene  2000     1    72 5380000  71400      701       6.3  2000
5 Abilene  2010     1    73 9130783 112200      868       6.4  2010
6 Abilene  2001     1    75 5730000  64500      779       6.8  2001
arrange_raw_multiple(txhousing, city, desc(sales))
# A tibble: 6 x 9
  city     year month sales   volume median listings inventory  date
  <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Abilene  2015     7   268 45845730 148700      986       5   2016.
2 Abilene  2015     6   260 41396230 141500      965       5   2015.
3 Abilene  2007     7   239 29315000 114300      940       5.2 2008.
4 Abilene  2013     8   236 30777727 120000      976       5.4 2014.
5 Abilene  2014     7   231 35861350 145800     1033       5.8 2014.
6 Abilene  2005     6   230 24050000  92500      664       4.1 2005.

Pass single column name as string

arrange_str <- function(df, var, .desc = FALSE) {
  if (.desc) {
    dplyr::arrange(.data = df, desc(df[[var]])) %>% head()
  } else {
    dplyr::arrange(.data = df, df[[var]]) %>% head()
  }
}

arrange_str(txhousing, "sales")
# A tibble: 6 x 9
  city         year month sales volume median listings inventory  date
  <chr>       <int> <int> <dbl>  <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 San Marcos   2011    10     6 1.16e6 180000      163       8.3 2012.
2 Harlingen    2000     7     9 1.11e6  87500      719      30.8 2000.
3 South Padr~  2011     1     9 2.09e6 225000     1258      55.7 2011 
4 San Marcos   2011     1    10 1.48e6 140000      165       7.5 2011 
5 San Marcos   2011    12    10 1.56e6 140000      148       8   2012.
6 San Marcos   2014    11    10 1.51e6 146700       96       4   2015.
arrange_str(txhousing, "sales", .desc = TRUE)
# A tibble: 6 x 9
  city     year month sales     volume median listings inventory  date
  <chr>   <int> <int> <dbl>      <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Houston  2015     7  8945 2568156780 217600    23875       3.4 2016.
2 Houston  2006     6  8628 1795898108 155200    36281       5.6 2006.
3 Houston  2013     7  8468 2168720825 187800    21497       3.3 2014.
4 Houston  2015     6  8449 2490238594 222400    22311       3.2 2015.
5 Houston  2013     5  8439 2121508529 186100    20526       3.3 2013.
6 Houston  2014     6  8391 2342443127 211200    19725       2.9 2014.

Pass multiple column name as string

arrange_str_multiple <- function(df, var, desc = FALSE) {
  if (desc) {
    dplyr::arrange(.data = df, desc(df[var])) %>% head()
  } else {
    dplyr::arrange(.data = df, df[var]) %>% head()
  }
}

# This function arranges the dataframe either all ascending
# or all descending. Definitely need a better example.

arrange_str_multiple(txhousing, c("year", "month", "sales"))
# A tibble: 6 x 9
  city         year month sales volume median listings inventory  date
  <chr>       <int> <int> <dbl>  <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Paris        2000     1    19 1.44e6  71700      286       7.5  2000
2 San Marcos   2000     1    22 2.38e6 106700      190       6.3  2000
3 Lufkin       2000     1    28 2.28e6  68000       NA      NA    2000
4 Harlingen    2000     1    31 3.91e6  87500      644      24.9  2000
5 Galveston    2000     1    37 4.56e6  95000      636       9.1  2000
6 Port Arthur  2000     1    40 3.09e6  68300      314       5.6  2000
arrange_str_multiple(txhousing, c("year", "month", "sales"), desc = TRUE)
# A tibble: 6 x 9
  city         year month sales volume median listings inventory  date
  <chr>       <int> <int> <dbl>  <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Houston      2015     7  8945 2.57e9 217600    23875       3.4 2016.
2 Dallas       2015     7  7038 2.02e9 233000    12292       2.4 2016.
3 Austin       2015     7  3466 1.15e9 264600     7913       3   2016.
4 San Antonio  2015     7  2962 7.05e8 198100     9462       4.1 2016.
5 Collin Cou~  2015     7  1861 6.14e8 292600     2809       2.1 2016.
6 Fort Bend    2015     7  1372 4.32e8 280400     3328       3.1 2016.

group_by()

In group_by(), we select which columns to, well, group by! (Damn these well-named functions!). So one can use the same techniques as select() to choose the columns.

In the following examples, we will create only one summarised value total_sales for simplicity.

Passing single raw name

group_raw <- function(df, grp) {
  df %>% 
    group_by({{grp}}) %>% 
    summarise(total_sales = sum(sales, na.rm = TRUE),
              .groups = 'drop')  %>% 
    head(n=5)
}

group_raw(txhousing, year)        # Sum of sales per year
# A tibble: 5 x 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999
group_raw(txhousing, month)       # Sum of sales per month
# A tibble: 5 x 2
  month total_sales
  <int>       <dbl>
1     1      245924
2     2      296410
3     3      386909
4     4      397332
5     5      448968

Passing multiple raw names using the ... operator

group_raw_multiple <- function(df, ...) {
  df %>% 
    group_by(...) %>% 
    summarise(total_sales = sum(sales, na.rm = TRUE),
              .groups = 'drop')  %>% 
    head(n = 5)
}

group_raw_multiple(txhousing, year)              # Sum of sales per year
# A tibble: 5 x 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999
group_raw_multiple(txhousing, year, month)       # Sum of sales per month
# A tibble: 5 x 3
   year month total_sales
  <int> <int>       <dbl>
1  2000     1       11411
2  2000     2       15674
3  2000     3       20202
4  2000     4       18658
5  2000     5       22388

Passing single or multiple column names as character string

group_str <- function(df, grp) {
  df %>% 
    group_by(df[grp]) %>% 
    summarise(total_sales = sum(sales, na.rm = TRUE),
              .groups = 'drop')  %>% 
    head(n=5)
}

group_str(txhousing, "year")                   # Sum of sales per year
# A tibble: 5 x 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999
group_str(txhousing, c("year", "month"))       # Sum of sales per month
# A tibble: 5 x 3
   year month total_sales
  <int> <int>       <dbl>
1  2000     1       11411
2  2000     2       15674
3  2000     3       20202
4  2000     4       18658
5  2000     5       22388
# The same column names can be passed as variables containing the character names
yr <- "year"
group_str(txhousing, yr)
# A tibble: 5 x 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999
yrmon <- c("year", "month")
group_str(txhousing, yrmon)
# A tibble: 5 x 3
   year month total_sales
  <int> <int>       <dbl>
1  2000     1       11411
2  2000     2       15674
3  2000     3       20202
4  2000     4       18658
5  2000     5       22388

If you want the summarise column to have a custom name like total_<sumvar>, then you can wrap the value in quotes as below. This method uses the glue syntax enabled by the := walrus operator. The walrus operator takes either a raw name or a character string on its LHS.

group_raw2 <- function(df, grp, sumvar) {
  df %>% 
    group_by({{grp}}) %>% 
    summarise("total_{{sumvar}}" := sum({{sumvar}}, na.rm = TRUE),
              .groups = 'drop')  %>% 
    head(n=5)
}

group_raw2(txhousing, year, sales)            # Sum of sales per year
# A tibble: 5 x 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999
group_raw2(txhousing, month, listings)        # Sum of listings per month
# A tibble: 5 x 2
  month total_listings
  <int>          <dbl>
1     1        1854661
2     2        1888104
3     3        1949187
4     4        1991278
5     5        2038932

After writing so many examples, I see a pattern. group_by() works with techniques similar to select() while summarise() works with techniques similar to mutate().

(Slightly Better) Examples

The txhousing is a city-wise monthly sales and volume dataset. It has a year and month column. Let us create a date column and keep only those columns relevant for our custom tx_summary() function.

small_df <- txhousing %>% 
  mutate(date = lubridate::as_date(glue::glue("{year}-{month}-01"))) %>% 
  select(city, date, sales, volume)

mutate() example

Now let us create the create_ymq() function. This function would take 2 arguments, a data frame df and a raw name of a date column.

create_ymq <- function(df, date_col) {
  stopifnot(inherits(df, "data.frame"))
  stopifnot(class(df %>% dplyr::pull({{date_col}})) == 'Date')
  dplyr::mutate(df,
                Year = lubridate::year({{date_col}}),
                nHalf = lubridate::semester({{date_col}}),
                yHalf = lubridate::semester({{date_col}}, with_year = TRUE),
                dHalf = paste0(lubridate::semester({{date_col}}), "H", format({{date_col}},"%y")),
                nQtr = lubridate::quarter({{date_col}}),
                yQtr = lubridate::quarter({{date_col}}, with_year = TRUE),
                dQtr = paste0(lubridate::quarter({{date_col}}),"Q", format({{date_col}},"%y")),
                Month = lubridate::month({{date_col}}),
                yMonth = as.numeric(format({{date_col}}, "%Y.%m")),
                dMonth = format({{date_col}}, "%b %Y")
                )
}

create_ymq(df = small_df, date_col = date) %>% glimpse()
Rows: 8,602
Columns: 14
$ city   <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene"~
$ date   <date> 2000-01-01, 2000-02-01, 2000-03-01, 2000-04-01, 2000~
$ sales  <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 9~
$ volume <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 1391000~
$ Year   <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,~
$ nHalf  <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,~
$ yHalf  <dbl> 2000.1, 2000.1, 2000.1, 2000.1, 2000.1, 2000.1, 2000.~
$ dHalf  <chr> "1H00", "1H00", "1H00", "1H00", "1H00", "1H00", "2H00~
$ nQtr   <int> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1, 1, 1, 2, 2, 2,~
$ yQtr   <dbl> 2000.1, 2000.1, 2000.1, 2000.2, 2000.2, 2000.2, 2000.~
$ dQtr   <chr> "1Q00", "1Q00", "1Q00", "2Q00", "2Q00", "2Q00", "3Q00~
$ Month  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5,~
$ yMonth <dbl> 2000.01, 2000.02, 2000.03, 2000.04, 2000.05, 2000.06,~
$ dMonth <chr> "Jan 2000", "Feb 2000", "Mar 2000", "Apr 2000", "May ~

group_by() example

Now that we have a function that creates various date-related columns, let us create a function that let’s you create summary tables like annual sales per city, quarterly volumes per city etc.

tx_summary <- function(df, grp_col, sum_col) {
  df %>% 
    group_by(city, {{grp_col}}) %>% 
    summarise("total_{{sum_col}}" := sum({{sum_col}}, na.rm = TRUE), .groups = 'drop')
}

Using these 2 functions, we can now create multiple summary tables

small_df_with_date_cols <- small_df %>% create_ymq(date_col = date)

# Annual Sales per city
small_df_with_date_cols %>% tx_summary(grp_col = Year, sum_col = sales)
# A tibble: 736 x 3
   city     Year total_sales
   <chr>   <dbl>       <dbl>
 1 Abilene  2000        1375
 2 Abilene  2001        1431
 3 Abilene  2002        1516
 4 Abilene  2003        1632
 5 Abilene  2004        1830
 6 Abilene  2005        1977
 7 Abilene  2006        1997
 8 Abilene  2007        2003
 9 Abilene  2008        1651
10 Abilene  2009        1634
# ... with 726 more rows
# Half Yearly volumes per city
small_df_with_date_cols %>% tx_summary(grp_col = yHalf, sum_col = volume)
# A tibble: 1,472 x 3
   city    yHalf total_volume
   <chr>   <dbl>        <dbl>
 1 Abilene 2000.     55400000
 2 Abilene 2000.     53175000
 3 Abilene 2001.     55795000
 4 Abilene 2001.     58570000
 5 Abilene 2002.     55305000
 6 Abilene 2002.     63370000
 7 Abilene 2003.     58175000
 8 Abilene 2003.     77500000
 9 Abilene 2004.     74205000
10 Abilene 2004.     85465000
# ... with 1,462 more rows
# Quarterly Sales per city
small_df_with_date_cols %>% tx_summary(grp_col = yQtr, sum_col = sales)
# A tibble: 2,898 x 3
   city     yQtr total_sales
   <chr>   <dbl>       <dbl>
 1 Abilene 2000.         300
 2 Abilene 2000.         395
 3 Abilene 2000.         387
 4 Abilene 2000.         293
 5 Abilene 2001.         305
 6 Abilene 2001.         394
 7 Abilene 2001.         401
 8 Abilene 2001.         331
 9 Abilene 2002.         295
10 Abilene 2002.         425
# ... with 2,888 more rows
# Monthly Volumes per city
small_df_with_date_cols %>% tx_summary(grp_col = yMonth, sum_col = volume)
# A tibble: 8,602 x 3
   city    yMonth total_volume
   <chr>    <dbl>        <dbl>
 1 Abilene  2000.      5380000
 2 Abilene  2000.      6505000
 3 Abilene  2000.      9285000
 4 Abilene  2000.      9730000
 5 Abilene  2000.     10590000
 6 Abilene  2000.     13910000
 7 Abilene  2000.     12635000
 8 Abilene  2000.     10710000
 9 Abilene  2000.      7615000
10 Abilene  2000.      7040000
# ... with 8,592 more rows

More ideas

You could further extend this by creating a custom filtering function that gives you, say, the rows with the highest or lowest total_sales or total_volume.

Conclusion

The ability to create such dynamic functions, enabled by the wonderful {dplyr} package, allows us to level-up in terms of programming with R and helps make our code neat and tidy.

How I feel while creating custom functions with {dplyr}! I can almost hear the music! Source: gifer.com

References